Lena Hammerer, Timo Heiß, Fabian Lulikat
Problemstellung: Gegeben sind Daten über verschiedene Brunnen in Tansania.
Zielstellung: Ziel ist die Vorhersage des Wartungszustands ("Status") der Brunnen mithilfe von Machine Learning Modellen. Die Zielgröße ist damit die Spalte "status_group" mit den Ausprägungen "functional", "functional needs repair" und "non functional". Spezifischer sollen drei Modelle erstellt werden:
Damit handelt es sich um Klassifikationsprobleme (zwei binäre, ein multinominales) des überwachten Lernens.
Metrik: Die Güte der Modelle soll anhand des AUC (Area Under Curve) der Testdaten gemessen werden.
Ziel ist es also ein vollständiges Notebook zu erstellen vom Datenimport, über EDA zur Aufbereitung der Daten, der Modellbildung, Analyse und der Validierung.
df_pump_train = pd.read_csv('pump_train.csv')
df_pump_test = pd.read_csv('pump_test.csv')
df_pump_train.head()
| id | amount_tsh | date_recorded | funder | gps_height | installer | longitude | latitude | wpt_name | num_private | ... | water_quality | quality_group | quantity | quantity_group | source | source_type | source_class | waterpoint_type | waterpoint_type_group | status_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 12237 | 30.0 | 2013-01-23 | Government Of Tanzania | 107 | District Council | 39.360880 | -10.295705 | Zahanati Membe | 0 | ... | soft | good | enough | enough | river | river/lake | surface | communal standpipe multiple | communal standpipe | functional |
| 1 | 24566 | 0.0 | 2013-02-07 | Private Individual | 0 | Edward | 32.644074 | -3.859265 | Kwa Mama Wini | 0 | ... | soft | good | insufficient | insufficient | shallow well | shallow well | groundwater | other | other | functional |
| 2 | 20536 | 0.0 | 2011-07-15 | Sawaka | 0 | DWE | 30.999792 | -1.721771 | Kwasimon | 0 | ... | soft | good | enough | enough | shallow well | shallow well | groundwater | other | other | non functional |
| 3 | 30633 | 200.0 | 2011-03-25 | Government Of Tanzania | 2142 | Commu | 34.060324 | -9.335288 | none | 0 | ... | soft | good | enough | enough | spring | spring | groundwater | communal standpipe | communal standpipe | functional |
| 4 | 2993 | 0.0 | 2011-02-17 | African Muslim Agency | 290 | District Council | 38.862874 | -7.165410 | Msikitini | 0 | ... | soft | good | dry | dry | machine dbh | borehole | groundwater | hand pump | hand pump | non functional |
5 rows × 41 columns
df_pump_test.head()
| id | amount_tsh | date_recorded | funder | gps_height | installer | longitude | latitude | wpt_name | num_private | ... | water_quality | quality_group | quantity | quantity_group | source | source_type | source_class | waterpoint_type | waterpoint_type_group | status_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 69572 | 6000.0 | 2011-03-14 | Roman | 1390 | Roman | 34.938093 | -9.856322 | none | 0 | ... | soft | good | enough | enough | spring | spring | groundwater | communal standpipe | communal standpipe | functional |
| 1 | 8776 | 0.0 | 2013-03-06 | Grumeti | 1399 | GRUMETI | 34.698766 | -2.147466 | Zahanati | 0 | ... | soft | good | insufficient | insufficient | rainwater harvesting | rainwater harvesting | surface | communal standpipe | communal standpipe | functional |
| 2 | 49056 | 0.0 | 2011-02-20 | Private | 62 | Private | 39.209518 | -7.034139 | Mzee Hokororo | 0 | ... | salty | salty | enough | enough | machine dbh | borehole | groundwater | other | other | functional |
| 3 | 58155 | 0.0 | 2011-09-04 | Unicef | 1656 | DWE | 34.569266 | -9.085515 | Kwa Rose Chaula | 0 | ... | soft | good | dry | dry | river | river/lake | surface | communal standpipe | communal standpipe | non functional |
| 4 | 34169 | 0.0 | 2011-07-22 | Hesawa | 1162 | DWE | 32.920154 | -1.947868 | Ngomee | 0 | ... | milky | milky | insufficient | insufficient | spring | spring | groundwater | other | other | functional needs repair |
5 rows × 41 columns
| Variable | Beschreibung |
|---|---|
| amount_tsh | Total static head (amount water available to waterpoint) |
| date_recorded | The date the row was entered |
| funder | Who funded the well |
| gps_height | Altitude of the well |
| installer | Organization that installed the well |
| longitude | GPS coordinate |
| latitude | GPS coordinate |
| wpt_name | Name of the waterpoint if there is one |
| num_private | ? |
| basin | Geographic water basin |
| subvillage | Geographic location |
| region | Geographic location |
| region_code | Geographic location (coded) |
| district_code | Geographic location (coded) |
| lga | Geographic location |
| ward | Geographic location |
| population | Population around the well |
| public_meeting | True/False |
| recorded_by | Group entering this row of data |
| scheme_management | Who operates the waterpoint |
| scheme_name | Who operates the waterpoint |
| permit | If the waterpoint is permitted |
| construction_year | Year the waterpoint was constructed |
| extraction_type | The kind of extraction the waterpoint uses |
| extraction_type_group | The kind of extraction the waterpoint uses |
| extraction_type_class | The kind of extraction the waterpoint uses |
| management | How the waterpoint is managed |
| management_group | How the waterpoint is managed |
| payment | What the water costs |
| payment_type | What the water costs |
| water_quality | The quality of the water |
| quality_group | The quality of the water |
| quantity | The quantity of water |
| quantity_group | The quantity of water |
| source | The source of the water |
| source_type | The source of the water |
| source_class | The source of the water |
| waterpoint_type | The kind of waterpoint |
| waterpoint_type_group | The kind of waterpoint |
| status_group | functional or non-functional or functional needs repair |
df_pump_train.shape
(50490, 41)
df_pump_train.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 50490.0 | 37113.857695 | 21447.241039 | 0.000000 | 18517.250000 | 37027.500000 | 55636.750000 | 7.424700e+04 |
| amount_tsh | 50490.0 | 319.158123 | 2987.517185 | 0.000000 | 0.000000 | 0.000000 | 20.000000 | 3.500000e+05 |
| gps_height | 50490.0 | 669.714280 | 693.123330 | -90.000000 | 0.000000 | 371.000000 | 1322.000000 | 2.770000e+03 |
| longitude | 50490.0 | 34.078591 | 6.553382 | 0.000000 | 33.084409 | 34.906548 | 37.178899 | 4.034519e+01 |
| latitude | 50490.0 | -5.708336 | 2.943806 | -11.648378 | -8.541643 | -5.019807 | -3.327550 | -2.000000e-08 |
| num_private | 50490.0 | 0.459497 | 10.413455 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.402000e+03 |
| region_code | 50490.0 | 15.280511 | 17.550035 | 1.000000 | 5.000000 | 12.000000 | 17.000000 | 9.900000e+01 |
| district_code | 50490.0 | 5.641731 | 9.668596 | 0.000000 | 2.000000 | 3.000000 | 5.000000 | 8.000000e+01 |
| population | 50490.0 | 180.913389 | 479.854870 | 0.000000 | 0.000000 | 25.000000 | 215.750000 | 3.050000e+04 |
| construction_year | 50490.0 | 1301.724520 | 951.258506 | 0.000000 | 0.000000 | 1986.000000 | 2004.000000 | 2.013000e+03 |
df_pump_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50490 entries, 0 to 50489 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 50490 non-null int64 1 amount_tsh 50490 non-null float64 2 date_recorded 50490 non-null object 3 funder 47396 non-null object 4 gps_height 50490 non-null int64 5 installer 47380 non-null object 6 longitude 50490 non-null float64 7 latitude 50490 non-null float64 8 wpt_name 50490 non-null object 9 num_private 50490 non-null int64 10 basin 50490 non-null object 11 subvillage 50175 non-null object 12 region 50490 non-null object 13 region_code 50490 non-null int64 14 district_code 50490 non-null int64 15 lga 50490 non-null object 16 ward 50490 non-null object 17 population 50490 non-null int64 18 public_meeting 47644 non-null object 19 recorded_by 50490 non-null object 20 scheme_management 47221 non-null object 21 scheme_name 26548 non-null object 22 permit 47899 non-null object 23 construction_year 50490 non-null int64 24 extraction_type 50490 non-null object 25 extraction_type_group 50490 non-null object 26 extraction_type_class 50490 non-null object 27 management 50490 non-null object 28 management_group 50490 non-null object 29 payment 50490 non-null object 30 payment_type 50490 non-null object 31 water_quality 50490 non-null object 32 quality_group 50490 non-null object 33 quantity 50490 non-null object 34 quantity_group 50490 non-null object 35 source 50490 non-null object 36 source_type 50490 non-null object 37 source_class 50490 non-null object 38 waterpoint_type 50490 non-null object 39 waterpoint_type_group 50490 non-null object 40 status_group 50490 non-null object dtypes: float64(3), int64(7), object(31) memory usage: 15.8+ MB
print("Unique Values:")
nuniques = pd.DataFrame({"Feature":[], "nuniques":[]})
for col in df_pump_train.columns:
if df_pump_train[col].dtype == object:
nuniques.loc[len(nuniques)] = [col, df_pump_train[col].nunique()]
nuniques.sort_values("nuniques", ascending=False)
Unique Values:
| Feature | nuniques | |
|---|---|---|
| 3 | wpt_name | 32390.0 |
| 5 | subvillage | 17774.0 |
| 12 | scheme_name | 2583.0 |
| 8 | ward | 2085.0 |
| 2 | installer | 1964.0 |
| 1 | funder | 1734.0 |
| 0 | date_recorded | 348.0 |
| 7 | lga | 125.0 |
| 6 | region | 21.0 |
| 14 | extraction_type | 18.0 |
| 15 | extraction_type_group | 13.0 |
| 11 | scheme_management | 12.0 |
| 17 | management | 12.0 |
| 25 | source | 10.0 |
| 4 | basin | 9.0 |
| 21 | water_quality | 8.0 |
| 16 | extraction_type_class | 7.0 |
| 19 | payment | 7.0 |
| 20 | payment_type | 7.0 |
| 26 | source_type | 7.0 |
| 28 | waterpoint_type | 7.0 |
| 22 | quality_group | 6.0 |
| 29 | waterpoint_type_group | 6.0 |
| 18 | management_group | 5.0 |
| 23 | quantity | 5.0 |
| 24 | quantity_group | 5.0 |
| 27 | source_class | 3.0 |
| 30 | status_group | 3.0 |
| 9 | public_meeting | 2.0 |
| 13 | permit | 2.0 |
| 10 | recorded_by | 1.0 |
df_pump_train["t_functional"] = df_pump_train["status_group"].apply(lambda x: 0 if x=="non functional" else 1)
df_pump_train["t_needs_rep"] = df_pump_train["status_group"].apply(lambda x: 1 if x=="functional needs repair" else 0)
targets = ["status_group", "t_functional", "t_needs_rep"]
plt.figure(figsize=(12,4))
i=0
for t in targets:
i+=1
plt.subplot(1, 3, i)
df_pump_train[t].value_counts(normalize=True).plot(kind='bar')
plt.title(t)
plt.tight_layout()
df_pump_train["id"] = df_pump_train["id"].astype("category")
df_pump_train["region_code"] = df_pump_train["region_code"].astype("category")
df_pump_train["district_code"] = df_pump_train["district_code"].astype("category")
sns.pairplot(pd.concat((df_pump_train.select_dtypes(exclude="object"), df_pump_train["status_group"]), axis=1), hue="status_group")
<seaborn.axisgrid.PairGrid at 0x19e0036c408>
plt.figure(figsize=(8,6))
sns.heatmap(df_pump_train.corr(), annot=True, mask=np.triu(df_pump_train.corr()), vmin=-1, vmax=1, linewidths=0.25);
num_features = list(filter(None, [col if df_pump_train[col].dtype == "int64" or df_pump_train[col].dtype == "float64" else None for col in df_pump_train.columns]))
num_features = list(set(num_features)-set(targets))
i=1
plt.figure(figsize=(12,18))
for f in num_features:
plt.subplot(len(num_features), 3, i)
sns.boxplot(x="t_functional", y=f, data=df_pump_train)
plt.subplot(len(num_features), 3, i+1)
sns.boxplot(x="t_needs_rep", y=f, data=df_pump_train)
plt.subplot(len(num_features), 3, i+2)
sns.boxplot(x="status_group", y=f, data=df_pump_train)
i+=3
plt.tight_layout()
construction_year ohne 0-Werte:
plt.figure(figsize=(12,4.5))
plt.subplot(1, 3, 1)
sns.boxplot(x="t_functional", y="construction_year", data=df_pump_train.loc[df_pump_train["construction_year"]!=0])
plt.subplot(1, 3, 2)
sns.boxplot(x="t_needs_rep", y="construction_year", data=df_pump_train.loc[df_pump_train["construction_year"]!=0])
plt.subplot(1, 3, 3)
sns.boxplot(x="status_group", y="construction_year", data=df_pump_train.loc[df_pump_train["construction_year"]!=0])
plt.xticks(rotation=90)
plt.tight_layout()
qualitative_features = list(filter(None, [col if df_pump_train[col].dtype==object or df_pump_train[col].dtype==bool or df_pump_train[col].dtype=='category' else None for col in df_pump_train.columns]))
qualitative_features = list(set(qualitative_features)-set(targets))
i=1
fig = plt.figure(figsize=(12,64))
for col in qualitative_features:
if df_pump_train[col].nunique() < 10 and df_pump_train[col].nunique()>1:
for target in targets:
ax = fig.add_subplot(15, 3, i)
ct = pd.crosstab(index=df_pump_train[col],columns=df_pump_train[target])
ct.plot.bar(ax=ax, rot=0)
plt.xticks(rotation=90)
plt.legend(fontsize=8)
i+=1
plt.tight_layout()
df_pump_train[["payment", "payment_type"]].drop_duplicates()
| payment | payment_type | |
|---|---|---|
| 0 | pay per bucket | per bucket |
| 1 | unknown | unknown |
| 2 | never pay | never pay |
| 3 | pay monthly | monthly |
| 19 | other | other |
| 20 | pay annually | annually |
| 43 | pay when scheme fails | on failure |
import geopandas as gpd
fig, ax = plt.subplots(figsize=(8,6))
countries = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
countries[countries["continent"] == "Africa"].plot(color="lightgrey", ax=ax)
df_pump_train.plot(x="longitude", y="latitude", kind="scatter", c="tab:blue", colormap="Blues", title="Geographic Distribution", ax=ax)
<AxesSubplot:title={'center':'Geographic Distribution'}, xlabel='longitude', ylabel='latitude'>
import plotly.express as px
fig = px.scatter_mapbox(df_pump_train, lat="latitude", lon="longitude", color="status_group", zoom=5, opacity=0.25, hover_name="id")
fig.update_layout(
mapbox_style="white-bg",
mapbox_layers=[
{
"below": 'traces',
"sourcetype": "raster",
"sourceattribution": "United States Geological Survey",
"source": [
"https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"
]
}
])
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
# interesting = ["gps_height", "basin", "region"]
fig = px.scatter_mapbox(df_pump_train, lat="latitude", lon="longitude", color="region", zoom=5, opacity=0.25, hover_name="id")
fig.update_layout(
mapbox_style="white-bg",
mapbox_layers=[
{
"below": 'traces',
"sourcetype": "raster",
"sourceattribution": "United States Geological Survey",
"source": [
"https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"
]
}
])
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
fig = px.scatter_mapbox(df_pump_train, lat="latitude", lon="longitude", color="construction_year", zoom=5, opacity=0.25, hover_name="id")
fig.update_layout(
mapbox_style="white-bg",
mapbox_layers=[
{
"below": 'traces',
"sourcetype": "raster",
"sourceattribution": "United States Geological Survey",
"source": [
"https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"
]
}
])
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
Die Vorbereitung der Daten wird in einer Pipeline durchgeführt, die schnell und einfach angepasst sowie problemlos auf die Testddaten angewandt werden kann.
Zielsetzung:
#Prüft ob Name vorhanden, dann 1, sonst 0
class YesNoEncoder(BaseEstimator, TransformerMixin):
def __init__(self, add_column = True):
self.add_column = add_column
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
if self.add_column:
has_name = np.where(X['wpt_name']!= 'none', 1, 0)
Xcopy = X
Xcopy['wpt_name'] = has_name
return Xcopy
#Füllt Nullwerte auf, erkennt auch Werte, die vom normalen Imputer nicht erkannt werden: if str(x)==str(np.nan)
class MyImputer(BaseEstimator, TransformerMixin):
def __init__(self, fill_value="other"):
self.fill_value = fill_value
return None
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return np.array([self.fill_value if str(x)==str(np.nan) else x for x in X]).reshape(-1,1)
scheme_pipeline = Pipeline([
('imputer', MyImputer(fill_value="other")),
('one_hot', OneHotEncoder(handle_unknown='ignore'))
])
scheme_pipeline_lbl = Pipeline([
('imputer', MyImputer(fill_value="other")),
('ord_enc', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)),
('imp2', SimpleImputer(fill_value=0.5)),
('min_max', MinMaxScaler())
])
#Reduziert Datumswerte auf reine Jahresinformation
class YearAdder(BaseEstimator, TransformerMixin):
def __init__(self):
return None
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
date_column = pd.Series(X)
return np.array(pd.DatetimeIndex(date_column).year).reshape(-1,1)
year_pipeline = Pipeline([
("add_year", YearAdder()),
("scale", MinMaxScaler())
])
age_pipeline = Pipeline([
("imp_age", SimpleImputer(strategy="median")),
("scale", MinMaxScaler())
])
col_imp = Pipeline([
('simp_inputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
('target_enc', TargetEncoder()),
('min_max', MinMaxScaler())
])
target_enc_pipe = Pipeline([
('target_enc', TargetEncoder()),
('min_max', MinMaxScaler())
])
ord_enc_pipe = Pipeline([
('ord_enc', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)),
('imp2', SimpleImputer(fill_value=0.5)),
('min_max', MinMaxScaler())
])
df_train = df_pump_train.copy()
#Erstellt neue Zielvariable für die unterschiedlichen Fragestellungen
df_train["t_functional"] = df_train["status_group"].apply(lambda x: 0 if x=="non functional" else 1)
df_train["t_needs_rep"] = df_train["status_group"].apply(lambda x: 1 if x=="functional needs repair" else 0)
#Ausreißer entfernen
df_train = df_train.drop(df_train[df_train.longitude == 0].index)
#Manuelles Featuere Engineering
df_train["amount_per_capita"] = df_train.apply(lambda x: x["amount_tsh"]/x["population"] if x["population"] != 0 else 0, axis=1)
df_train["age"]=df_train['date_recorded'].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))-df_train["construction_year"].apply(lambda x: datetime.datetime.strptime(str(x),'%Y') if x!=0 else datetime.datetime.now())
df_train["age"].loc[df_train["construction_year"]!=0]=df_train["age"].loc[df_train["construction_year"]!=0].apply(lambda x: x.days)
df_train["age"].loc[df_train["construction_year"]==0]=np.nan
#Datentypen manuell transformieren
df_train["district_code"] = df_train["district_code"].astype("category")
df_train["region_code"] = df_train["region_code"].astype("category")
#label / target Enconding
y_train = df_train.loc[:,'status_group']
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_train_f = df_train["t_functional"]
y_train_r = df_train["t_needs_rep"]
Verschiedene Transformer, die später getestet werden.
Alle Spalten mit target, binary oder one-hot encoding transformiert, Duplikate werden gedroppt:
col_trans = ColumnTransformer(transformers=[
['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
['target_enc', target_enc_pipe, ['lga', 'ward', 'management']],
['oh_enc', OneHotEncoder(), ['basin', 'payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class']],
['bin_enc', BinaryEncoder(), ['region', 'region_code', 'district_code']],
['cust_trans', YesNoEncoder(), ['wpt_name']],
["num", MinMaxScaler(), ["amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude']],
["scheme", scheme_pipeline, "scheme_management"],
["year", year_pipeline, "date_recorded"]
], remainder='drop')
Transformation mit Ordinal Encoder:
col_trans_lbl = ColumnTransformer(transformers=[
['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
['target_enc', target_enc_pipe, ['lga', 'ward', 'management']],
['ord_enc', ord_enc_pipe, ['basin', 'payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class', 'region', 'region_code', 'district_code']],
['cust_trans', YesNoEncoder(), ['wpt_name']],
["num", MinMaxScaler(), ["amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude']],
["scheme", scheme_pipeline_lbl, "scheme_management"],
["year", year_pipeline, "date_recorded"]
], remainder='drop')
Fügt neues Merkmal 'Pumpenalter' hinzu:
col_trans_w_age = ColumnTransformer(transformers=[
['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
['target_enc', target_enc_pipe, ['lga', 'ward', 'management']],
['oh_enc', OneHotEncoder(), ['basin', 'payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class']],
['bin_enc', BinaryEncoder(), ['region', 'region_code', 'district_code']], # warum alle 3 ?
['cust_trans', YesNoEncoder(), ['wpt_name']],
["num", MinMaxScaler(), ["amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude']],
["scheme", scheme_pipeline, "scheme_management"],
["year", year_pipeline, "date_recorded"],
["age", age_pipeline, ["age"]]
], remainder='drop')
Kombination aus Alter und Ordinal Encoding
col_trans_lbl_w_age = ColumnTransformer(transformers=[
['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
['target_enc', target_enc_pipe, ['lga', 'ward', 'management']],
['ord_enc', ord_enc_pipe, ['basin', 'payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class', 'region', 'region_code', 'district_code']],
['cust_trans', YesNoEncoder(), ['wpt_name']],
["num", MinMaxScaler(), ["amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude']],
["scheme", scheme_pipeline_lbl, "scheme_management"],
["year", year_pipeline, "date_recorded"],
["age", age_pipeline, ["age"]]
], remainder='drop')
Erstellung eines reduzierten Featuressets (16 Merkmale) mit Ordinal Encoding:
col_trans_lbl_reduced = ColumnTransformer(transformers=[
['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
['target_enc', target_enc_pipe, ['lga', 'ward']],
['ord_enc', ord_enc_pipe, ['payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class', 'region', 'region_code']],
['cust_trans', YesNoEncoder(), ['wpt_name']],
["num", MinMaxScaler(), ["amount_tsh"]],
["year", year_pipeline, "date_recorded"]
], remainder='drop')
Fitting der Transformer auf die Trainingsdaten und Anwendung auf die selbigen:
X_train = col_trans.fit_transform(X=df_train, y=y_train)
X_train_lbl = col_trans_lbl.fit_transform(X=df_train, y=y_train)
X_train_w_age = col_trans_w_age.fit_transform(X=df_train, y=y_train)
X_train_lbl_w_age = col_trans_lbl_w_age.fit_transform(X=df_train, y=y_train)
X_train_lbl_reduced = col_trans_lbl_reduced.fit_transform(X=df_train, y=y_train)
Für die unterschiedlichen Ausprägungen der Target Variable status_group besteht, wie in der Data Analysis bereits festgestellt, ein Ungleichgewicht. Datenpunkte für die Gruppe 'functional needs repair' machen nur einen kleinen Teil der Gesamtheit aus. Dieses Ungeleichgewicht kann das Training unserer Modelle beeinflussen, da sich diese beim Training hauptsächlich auf die 'Majority Class' konzentrieren.
Möglichkeiten mit Ungleichgewicht umzugehen:
Wir entscheiden uns für Undersampling, da trotz der Einschränkung der Größe des Datensatzes, noch viele Datenpunkte zur Verfügung stehen.
from imblearn.under_sampling import RandomUnderSampler
#Trennung nach Reparaturstatus
df_train_1 = df_train[df_train["t_needs_rep"] == 1]
df_train_0 = df_train[df_train["t_needs_rep"] == 0]
num_r = len(df_train_1)
print(f"{num_r} pumps need repairs, therefore we need the same amount of pumps that don't need repairs")
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(df_train, y_train_r)
print(X_resampled["t_needs_rep"].value_counts())
3356 pumps need repairs, therefore we need the same amount of pumps that don't need repairs 0 3356 1 3356 Name: t_needs_rep, dtype: int64
# neue Instanzen der Transformer, teils mit setzen von handle_unknown
col_trans_u = ColumnTransformer(transformers=[
['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
['target_enc', target_enc_pipe, ['lga', 'ward', 'management']],
['oh_enc', OneHotEncoder(handle_unknown='ignore'), ['basin', 'payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class']],
['bin_enc', BinaryEncoder(handle_unknown='ignore'), ['region', 'region_code', 'district_code']], # warum alle 3 ?
['cust_trans', YesNoEncoder(), ['wpt_name']],
["num", MinMaxScaler(), ["amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude']],
["scheme", scheme_pipeline, "scheme_management"],
["year", year_pipeline, "date_recorded"]
], remainder='drop')
col_trans_lbl_w_age_u = ColumnTransformer(transformers=[
['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
['target_enc', target_enc_pipe, ['lga', 'ward', 'management']],
['ord_enc', ord_enc_pipe, ['basin', 'payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class', 'region', 'region_code', 'district_code']],
['cust_trans', YesNoEncoder(), ['wpt_name']],
["num", MinMaxScaler(), ["amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude']],
["scheme", scheme_pipeline_lbl, "scheme_management"],
["year", year_pipeline, "date_recorded"],
["age", age_pipeline, ["age"]]
], remainder='drop')
col_trans_lbl_reduced_u = ColumnTransformer(transformers=[
['col_imp', col_imp, ['funder', 'installer', 'subvillage']],
['target_enc', target_enc_pipe, ['lga', 'ward']],
['ord_enc', ord_enc_pipe, ['payment','water_quality', 'quantity','source','waterpoint_type', 'extraction_type_class', 'region', 'region_code']],
['cust_trans', YesNoEncoder(), ['wpt_name']],
["num", MinMaxScaler(), ["amount_tsh"]],
["year", year_pipeline, "date_recorded"]
], remainder='drop')
Anwendung der ColumnTransformer auf unsere Trainingsdaten:
X_train_u = col_trans_u.fit_transform(X=X_resampled, y=y_resampled)
X_train_u_lbl = col_trans_lbl_w_age_u.fit_transform(X=X_resampled, y=y_resampled)
X_train_u_lbl_reduced = col_trans_lbl_reduced_u.fit_transform(X=X_resampled, y=y_resampled)
y_train_u_r = y_resampled
Prüfung auf Dimensionen nach der Transformation:
print(X_train.shape)
print(X_train_lbl.shape)
print(X_train_u.shape)
print(X_train_u_lbl.shape)
print(X_train_w_age.shape)
print(X_train_lbl_w_age.shape)
print(X_train_lbl_reduced.shape)
(48958, 95) (48958, 25) (6712, 92) (6712, 26) (48958, 96) (48958, 26) (48958, 16)
print(f'({X_train.min()}, {X_train.max()})')
print(f'({X_train_lbl.min()}, {X_train_lbl.max()})')
print(f'({X_train_u.min()}, {X_train.max()})')
print(f'({X_train_u_lbl.min()}, {X_train_lbl.max()})')
print(f'({X_train_w_age.min()}, {X_train_w_age.max()})')
print(f'({X_train_lbl_w_age.min()}, {X_train_lbl_w_age.max()})')
print(f'({X_train_lbl_reduced.min()}, {X_train_lbl_reduced.max()})')
(0.0, 1.0) (0.0, 1.0) (0.0, 1.0) (0.0, 1.0) (0.0, 1.0) (0.0, 1.0) (0.0, 1.0)
Training von Prognosemodelle für die 3 Aufgaben
df_results = pd.DataFrame(data=[], columns=["Model", "AUC Train", "AUC CV","Precsion","Recall"])
lr_cf = LogisticRegression(solver="sag", max_iter=1000, n_jobs= -1, class_weight='balanced')
lr_cf.fit(X_train, y_train_f)
y_pred_f = lr_cf.predict_log_proba(X_train)[:,1]
auc_train_f = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=lr_cf, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr', n_jobs= -1)
auc_cv = cv_res["test_score"].mean()
y_pred_f_bin = lr_cf.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)
df_results.loc[len(df_results)] = [lr_cf, auc_train_f, auc_cv, precision, recall]
dt_cf = DecisionTreeClassifier()
dt_cf.fit(X_train, y_train_f)
y_pred_f = dt_cf.predict_proba(X_train)[:,1]
auc_train_f = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=dt_cf, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr', n_jobs= -1)
y_pred_f_bin = dt_cf.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)
df_results.loc[len(df_results)] = [dt_cf, auc_train_f, auc_cv, precision, recall]
rf_cf = RandomForestClassifier(random_state=42, n_jobs= -1)
rf_cf.fit(X_train, y_train_f)
y_pred_f = rf_cf.predict_proba(X_train)[:,1]
auc_train_f = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=rf_cf, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr', n_jobs= -1)
auc_cv = cv_res["test_score"].mean()
y_pred_f_bin = rf_cf.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)
df_results.loc[len(df_results)] = ['Random Forest', auc_train_f, auc_cv, precision, recall]
xgb_cf = XGBClassifier(eval_metric = 'auc', n_jobs= -1)
xgb_cf.fit(X_train, y_train_f)
y_pred_f = xgb_cf.predict_proba(X_train)[:,1]
auc_train = roc_auc_score(y_train_f, y_pred_f, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()
y_pred_f_bin = xgb_cf.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)
df_results.loc[len(df_results)] = ['xgBoost', auc_train, auc_cv, precision, recall]
knn_cf = KNeighborsClassifier(n_jobs= -1)
knn_cf.fit(X_train, y_train_f)
y_pred_F = knn_cf.predict_proba(X_train)
auc_train = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=knn_cf, X=X_train, y=y_train_f, cv=5, scoring='roc_auc_ovr', n_jobs= -1)
auc_cv = cv_res["test_score"].mean()
y_pred_f_bin = knn_cf.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)
df_results.loc[len(df_results)] = [knn_cf, auc_train, auc_cv, precision, recall]
svm_cf = SVC(probability=True,cache_size= 3000)
svm_cf.fit(X_train, y_train_f)
y_pred_f = svm_cf.predict_proba(X_train)[:,1]
print(y_pred_f)
auc_train = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=svm_cf, X=X_train, y=y_train_f, cv=5, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()
y_pred_f_bin = svm_cf.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)
df_results.loc[len(df_results)] = [svm_cf, auc_train, auc_cv, precision, recall]
[0.98465951 0.44259414 0.07015871 ... 0.9767568 0.10482034 0.16657297]
display(df_results)
| Model | AUC Train | AUC CV | Precsion | Recall | |
|---|---|---|---|---|---|
| 0 | LogisticRegression(class_weight='balanced', ma... | 0.917535 | 0.916595 | 0.876565 | 0.866474 |
| 1 | DecisionTreeClassifier() | 1.000000 | 0.916595 | 1.000000 | 1.000000 |
| 2 | Random Forest | 1.000000 | 0.932227 | 0.999933 | 1.000000 |
| 3 | xgBoost | 0.967729 | 0.943200 | 0.893030 | 0.956678 |
| 4 | KNeighborsClassifier(n_jobs=-1) | 0.967729 | 0.897929 | 0.893092 | 0.937280 |
| 5 | SVC(cache_size=3000, probability=True) | 0.940938 | 0.932008 | 0.865777 | 0.943701 |
df_tuned_f = pd.DataFrame(data=[], columns=["Model", "AUC Train", "AUC CV", "Precision", "Recall"])
param_grid = [{
"n_estimators":[570,590, 600, 700],
"max_depth": [17, 18, 20, 25],
"bootstrap": [True, False]
}]
rf_cf = RandomForestClassifier(random_state=42, n_jobs=-1)
rs = RandomizedSearchCV(rf_cf, param_distributions=param_grid, scoring="roc_auc_ovr", cv=10, n_jobs=-1)
rs.fit(X_train, y_train_f)
best_rf = rs.best_params_
best_rf
{'n_estimators': 700, 'max_depth': 18, 'bootstrap': True}
param_grid = [{
"n_estimators":[590,595],
"max_depth": [16, 17],
"bootstrap": [False]
}]
rf_cf = RandomForestClassifier(random_state=42, n_jobs=-1)
rs = GridSearchCV(rf_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=10, n_jobs=-1)
rs.fit(X_train, y_train_f)
best_rf = rs.best_params_
print(best_rf)
{'bootstrap': False, 'max_depth': 17, 'n_estimators': 590}
rf_cf_t = RandomForestClassifier(random_state=42,
bootstrap = best_rf.get('bootstrap'),
max_depth = best_rf.get('max_depth'),
n_estimators =best_rf.get('n_estimators'),
n_jobs=-1
)
rf_cf_t.fit(X_train, y_train_f)
y_pred_f = rf_cf_t.predict_proba(X_train)[:,1]
auc_train_f = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=rf_cf_t, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()
y_pred_f_bin = rf_cf_t.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)
df_tuned_f.loc[len(df_tuned_f)] = ["Random Forest Tuned", auc_train_f, auc_cv, precision, recall]
param_grid = [{
"criterion":["gini", "entropy"],
"max_depth": [8, 9, 10, 11, 12],
"min_samples_split": [3,5,6,7, 10],
"min_samples_leaf": [1,2,3,5]
}]
dt_cf = DecisionTreeClassifier(random_state=42)
rs = RandomizedSearchCV(dt_cf, param_distributions=param_grid, scoring="roc_auc_ovr", cv=10, n_iter = 10, n_jobs=-1)
rs.fit(X_train, y_train_f)
best_dt = rs.best_params_
best_dt
{'min_samples_split': 6,
'min_samples_leaf': 5,
'max_depth': 10,
'criterion': 'gini'}
param_grid = [{
"criterion":["gini", "entropy"],
"max_depth": [10, 11, 12],
"min_samples_split": [6,7,8],
"min_samples_leaf": [2,3,4]
}]
dt_cf = DecisionTreeClassifier(random_state=42)
rs = GridSearchCV(dt_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=10, n_jobs=-1)
rs.fit(X_train, y_train_f)
best_dt = rs.best_params_
print(best_dt)
{'criterion': 'entropy', 'max_depth': 11, 'min_samples_leaf': 4, 'min_samples_split': 6}
dt_cf_t = DecisionTreeClassifier(
random_state=42,
criterion = best_dt.get('criterion'),
max_depth = best_dt.get('max_depth'),
min_samples_split=best_dt.get("min_samples_split"),
min_samples_leaf=best_dt.get("min_samples_leaf")
)
dt_cf_t.fit(X_train, y_train_f)
y_pred_f = dt_cf_t.predict_proba(X_train)[:,1]
auc_train_f = roc_auc_score(y_train_f, y_pred_f)
cv_res = cross_validate(estimator=dt_cf_t, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()
y_pred_f_bin = dt_cf_t.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)
df_tuned_f.loc[len(df_tuned_f)] = ["Decision Tree Tuned", auc_train_f, auc_cv, precision, recall]
param_grid = [{
"eta":[0.1, 0.15, 0.2, 0.5],
"max_depth": [9,10,12,15],
"min_child_weight": [0.25, 0,5, 0.75, 1]
}]
xg_cf_t= XGBClassifier(random_state=42, n_jobs= -1, eval_metric = 'auc')
rs = RandomizedSearchCV(xg_cf_t, param_distributions=param_grid, scoring="roc_auc_ovr", cv=10, n_jobs= -1, n_iter = 10)
rs.fit(X_train, y_train_f)
best_xg = rs.best_params_
print(best_xg)
{'min_child_weight': 5, 'max_depth': 12, 'eta': 0.2}
param_grid = [{
"eta":[0.2, 0.215],
"max_depth": [9,10],
"min_child_weight": [1, 1.25]
}]
xg_cf =XGBClassifier(random_state=42)
rs = GridSearchCV(xg_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=10, n_jobs=-1)
rs.fit(X_train, y_train_f)
best_dt = rs.best_params_
print(best_dt)
{'eta': 0.2, 'max_depth': 9, 'min_child_weight': 1.25}
xgb_cf_t = XGBClassifier(nthread=-1,eta=best_xg.get('eta'), max_depth= best_xg.get('max_depth'), min_child_weight= best_xg.get('min_child_weight'), eval_metric = 'auc', n_jobs= -1)
xgb_cf_t.fit(X_train, y_train_f)
y_pred_f = xgb_cf_t.predict_proba(X_train)[:,1]
auc_train = roc_auc_score(y_train_f, y_pred_f, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf_t, X=X_train, y=y_train_f, cv=10, scoring='roc_auc_ovr', n_jobs= -1)
auc_cv = cv_res["test_score"].mean()
y_pred_f_bin = xgb_cf_t.predict(X_train)
precision = precision_score(y_train_f, y_pred_f_bin)
recall = recall_score(y_train_f, y_pred_f_bin)
df_tuned_f.loc[len(df_tuned_f)] = [xgb_cf_t, auc_train, auc_cv, precision, recall]
display(df_results)
display(df_tuned_f)
| Model | AUC Train | AUC CV | Precsion | Recall | |
|---|---|---|---|---|---|
| 0 | LogisticRegression(class_weight='balanced', ma... | 0.917535 | 0.916595 | 0.876565 | 0.866474 |
| 1 | DecisionTreeClassifier() | 1.000000 | 0.916595 | 1.000000 | 1.000000 |
| 2 | Random Forest | 1.000000 | 0.932227 | 0.999933 | 1.000000 |
| 3 | xgBoost | 0.967729 | 0.943200 | 0.893030 | 0.956678 |
| 4 | KNeighborsClassifier(n_jobs=-1) | 0.967729 | 0.897929 | 0.893092 | 0.937280 |
| 5 | SVC(cache_size=3000, probability=True) | 0.940938 | 0.932008 | 0.865777 | 0.943701 |
| Model | AUC Train | AUC CV | Precision | Recall | |
|---|---|---|---|---|---|
| 0 | Random Forest Tuned | 0.992042 | 0.939364 | 0.926745 | 0.992580 |
| 1 | Decision Tree Tuned | 0.928126 | 0.910235 | 0.860137 | 0.909563 |
| 2 | XGBClassifier(base_score=0.5, booster='gbtree'... | 0.985583 | 0.943872 | 0.925581 | 0.971684 |
Für das binäre Klassifizierungsproblem kommen grundsätzlich unterschiedliche Modelle in Frage. Durch Training, Vergleich und Tuning soll das beste Vorgehen zur Prognose von status_group = functional needs repair bestimmt werden.
Im ersten Schritt erfolgt das Training unterschiedlicher Modelle mit der Standard-Konfiguration, um eine Vorauswahl zu treffen.
df_results_r = pd.DataFrame(data=[], columns=["Model", "AUC Train", "AUC CV", "ACC CV"])
Es soll die Area Under the Curve (AUC) als Qualitätsmaß genutzt werden. Die Accuray ist jedoch auch von Interesse.
scoring = {"AUC": "roc_auc", "Accuracy": make_scorer(accuracy_score)}
lr_cf = LogisticRegression(random_state=42)
lr_cf.fit(X_train_u, y_train_u_r)
y_pred = lr_cf.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=lr_cf, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()
df_results_r.loc[len(df_results_r)] = [lr_cf, auc_train, auc_cv, acc_cv]
svm_cf = SVC(probability=True, random_state=42)
svm_cf.fit(X_train_u, y_train_u_r)
y_pred = svm_cf.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=svm_cf, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()
df_results_r.loc[len(df_results_r)] = [svm_cf, auc_train, auc_cv, acc_cv]
knn_cf = KNeighborsClassifier()
knn_cf.fit(X_train_u, y_train_u_r)
y_pred = knn_cf.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=knn_cf, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()
df_results_r.loc[len(df_results_r)] = [knn_cf, auc_train, auc_cv, acc_cv]
dt_cf_ = DecisionTreeClassifier(random_state=42)
dt_cf_.fit(X_train_u, y_train_u_r)
y_pred = dt_cf_.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=dt_cf_, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()
df_results_r.loc[len(df_results_r)] = [dt_cf_, auc_train, auc_cv, acc_cv]
rf_cf = RandomForestClassifier(random_state=42)
rf_cf.fit(X_train_u, y_train_u_r)
y_pred = rf_cf.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=rf_cf, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()
df_results_r.loc[len(df_results_r)] = [str(rf_cf), auc_train, auc_cv, acc_cv]
xgb_cf = XGBClassifier()
xgb_cf.fit(X_train_u, y_train_u_r)
y_pred = xgb_cf.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=xgb_cf, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()
df_results_r.loc[len(df_results_r)] = [xgb_cf, auc_train, auc_cv, acc_cv]
Mit 5-facher Kreuzvalidierung wurden Scores zur Bewertung der unterschiedlichen Modelle berechnet.
Fazit Modellauswahl: Potenzielle Verbesserungen bei Ensemble-Modellen, Logistischer Regression sowie SVM durch Tuning erforschen. Die Ansätze Entscheidungsbaum und KNN werden nicht weiter verfolgt.
display(df_results_r)
| Model | AUC Train | AUC CV | ACC CV | |
|---|---|---|---|---|
| 0 | LogisticRegression(random_state=42) | 0.923421 | 0.917910 | 0.824793 |
| 1 | SVC(probability=True, random_state=42) | 0.937588 | 0.915691 | 0.826281 |
| 2 | KNeighborsClassifier() | 0.933883 | 0.843600 | 0.770112 |
| 3 | DecisionTreeClassifier(random_state=42) | 1.000000 | 0.771158 | 0.771158 |
| 4 | RandomForestClassifier(random_state=42) | 1.000000 | 0.894417 | 0.810490 |
| 5 | XGBClassifier(base_score=0.5, booster='gbtree'... | 0.990137 | 0.914013 | 0.812576 |
Gewählte Modelle:
df_tuned_r = pd.DataFrame(data=[], columns=["Model", "AUC Train", "AUC CV", "ACC CV"])
Initiales Tuning mit RandomSearch, um den GridSearch Bereich einzuschränken.
model_params = {
'n_estimators' : np.linspace(100, 1000, 5, dtype=int),
'max_depth': np.linspace(5, 20, 4, dtype=int),
'min_samples_split' : np.linspace(2, 30, 5, dtype=int),
'min_samples_leaf' : np.linspace(1, 5, 3, dtype=int),
'bootstrap' : [True, False]
}
rf_cf_tuned = RandomForestClassifier(random_state=42)
rs = RandomizedSearchCV(rf_cf_tuned, param_distributions=model_params, scoring="roc_auc", cv=5, n_jobs=-1)
rs.fit(X_train_u, y_train_u_r)
rs.best_params_
{'n_estimators': 550,
'min_samples_split': 30,
'min_samples_leaf': 5,
'max_depth': 20,
'bootstrap': False}
GridSearch im Bereich der mit RandomSearch bestimmten optimalen Parameter.
param_grid = [{
'n_estimators' : [250, 500, 750],
'max_depth': [10, 15, 20],
'min_samples_split' : [20, 25, 30],
'min_samples_leaf' : [1, 2],
'bootstrap' : [False]
}]
rf_cf_tuned = RandomForestClassifier(random_state=42)
gs = GridSearchCV(rf_cf_tuned, param_grid=param_grid, scoring="roc_auc", cv=5, n_jobs=-1)
gs.fit(X_train_u, y_train_u_r)
gs.best_params_
{'bootstrap': False,
'max_depth': 15,
'min_samples_leaf': 1,
'min_samples_split': 30,
'n_estimators': 750}
Training eines Random Forests basierend auf den gefundenen optimalen Parametern.
rf_cf_tuned = gs.best_estimator_
rf_cf_tuned.fit(X_train_u, y_train_u_r)
y_pred = rf_cf_tuned.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=rf_cf_tuned, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()
df_tuned_r.loc[len(df_tuned_r)] = [str(rf_cf_tuned), auc_train, auc_cv, acc_cv]
Analoges Vorgehen für das Tuning der anderen Klassifizierer.
model_params = {
'penalty' : ['l1', 'l2'],
'C': np.linspace(0.1, 2, 5, dtype=float),
'solver' : ['liblinear', 'saga', 'lbfgs']
}
lr_cf_tuned = LogisticRegression(random_state=42)
rs = RandomizedSearchCV(lr_cf_tuned, param_distributions=model_params, scoring="roc_auc", cv=5, n_jobs=-1)
rs.fit(X_train_u, y_train_u_r)
rs.best_params_
{'solver': 'liblinear', 'penalty': 'l1', 'C': 1.05}
param_grid = [{
'penalty' : ['l1', 'l2'],
'C': [0.5, 0.6],
'solver': ['lbfgs', 'saga']
}]
lr_cf_tuned = LogisticRegression(random_state=42)
gs = GridSearchCV(lr_cf_tuned, param_grid=param_grid, scoring="roc_auc", cv=5, verbose=1, n_jobs=-1)
gs.fit(X_train_u, y_train_u_r)
gs.best_params_
Fitting 5 folds for each of 8 candidates, totalling 40 fits
{'C': 0.5, 'penalty': 'l1', 'solver': 'saga'}
lr_cf_tuned = gs.best_estimator_
lr_cf_tuned.fit(X_train_u, y_train_u_r)
y_pred = lr_cf_tuned.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=lr_cf_tuned, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()
df_tuned_r.loc[len(df_tuned_r)] = [lr_cf, auc_train, auc_cv, acc_cv]
param_grid = [{
'C': [0.1, 1, 10],
'gamma': ['auto', 'scale']
}]
svm_cf_tuned = SVC(probability=True, random_state=42, kernel="rbf")
gs = GridSearchCV(svm_cf_tuned, param_grid=param_grid, scoring="roc_auc", cv=5, n_jobs=-1)
gs.fit(X_train_u, y_train_u_r)
gs.best_params_
{'C': 10, 'gamma': 'auto'}
svm_cf_tuned = gs.best_estimator_
svm_cf_tuned.fit(X_train_u, y_train_u_r)
y_pred = svm_cf_tuned.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=svm_cf_tuned, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring, n_jobs=-1)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()
df_tuned_r.loc[len(df_tuned_r)] = [svm_cf_tuned, auc_train, auc_cv, acc_cv]
model_params = {
'n_estimators' : np.linspace(100, 400, 4, dtype=int),
'max_depth': np.linspace(5, 20, 4, dtype=int),
'learning_rate' : np.linspace(0.1, 1, 4, dtype=int)
}
xgb_cf_tuned = XGBClassifier(use_label_encoder=False)
rs = RandomizedSearchCV(xgb_cf_tuned, param_distributions=model_params, scoring="roc_auc", cv=5, n_jobs=-1)
rs.fit(X_train_u, y_train_u_r)
rs.best_params_
{'n_estimators': 200, 'max_depth': 10, 'learning_rate': 1}
param_grid = [{
"n_estimators":[100, 200, 300],
"max_depth": [5, 10],
"learning_rate": [0.1, 0.5]
}]
xgb_cf_tuned = XGBClassifier(use_label_encoder=False)
gs = GridSearchCV(xgb_cf_tuned, param_grid=param_grid, scoring="roc_auc", cv=5, n_jobs=-1)
gs.fit(X_train_u, y_train_u_r)
gs.best_params_
{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
xgb_cf = rs.best_estimator_
xgb_cf_tuned.fit(X_train_u, y_train_u_r)
y_pred = xgb_cf_tuned.predict_proba(X_train_u)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=xgb_cf_tuned, X=X_train_u, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()
df_tuned_r.loc[len(df_tuned_r)] = [xgb_cf_tuned, auc_train, auc_cv, acc_cv]
Mit 5-facher Kreuzvalidierung wurden Scores zur Bewertung der unterschiedlichen optimierten Modelle berechnet.
Fazit Tuning: Evaluation auf den Testdaten muss zeigen, welches Prognosemodell den Status functional needs repair am zuverlässigsten vorraussagen kann. Ein weitere Einflussfaktor für die Modellauswahl sind jedoch auch die "Trainingskosten" (in diesem Kontext: Zeit) der Modelle, hier schneidet die SVM im Vergleich schlechter ab als die anderen Optionen.
display(df_tuned_r)
| Model | AUC Train | AUC CV | ACC CV | |
|---|---|---|---|---|
| 0 | RandomForestClassifier(bootstrap=False, max_de... | 0.967892 | 0.911201 | 0.825239 |
| 1 | LogisticRegression(random_state=42) | 0.922846 | 0.918654 | 0.826580 |
| 2 | SVC(C=10, gamma='auto', probability=True, rand... | 0.931920 | 0.918766 | 0.831198 |
| 3 | XGBClassifier(base_score=0.5, booster='gbtree'... | 0.990137 | 0.914013 | 0.812576 |
Ordinal Encoding erzeugt wesentlich weniger Features als OneHot Encoding, suggeriert jedoch auch eine Reihung der kategorischen Feature-Ausprägungen, was nicht immer im Sinne der Daten ist.
Wirkt sich Ordinal Encoding positiv auf die Performance des Modells aus? Ein Experiment mit Random Forest.
param_grid = [{
'n_estimators' : [50, 100, 150],
'max_depth': [5, 10, 15],
'min_samples_split' : [20, 25, 30],
'min_samples_leaf' : [1, 2],
'bootstrap' : [False]
}]
rf_cf_lbl = RandomForestClassifier(random_state=42)
gs = GridSearchCV(rf_cf_lbl, param_grid=param_grid, scoring="roc_auc", cv=5, n_jobs=-1)
gs.fit(X_train_u_lbl, y_train_u_r)
gs.best_params_
{'bootstrap': False,
'max_depth': 10,
'min_samples_leaf': 1,
'min_samples_split': 25,
'n_estimators': 150}
rf_cf_lbl = gs.best_estimator_
rf_cf_lbl.fit(X_train_u_lbl, y_train_u_r)
y_pred = rf_cf_lbl.predict_proba(X_train_u_lbl)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=rf_cf_lbl, X=X_train_u_lbl, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()
print(f"{str(rf_cf_lbl)} \nAUC Train: {auc_train}\nAUC CV: {auc_cv}\nACC CV: {acc_cv}")
RandomForestClassifier(bootstrap=False, max_depth=10, min_samples_split=25,
n_estimators=150, random_state=42)
AUC Train: 0.9537723338272335
AUC CV: 0.9183492688465998
ACC CV: 0.827623389147015
Fazit Ordinal Encoding: Das Modell unterscheidet sich nur unwesentlich von dem zuvor trainierten Random Forest. Es ist keine wesentliche Verbesserung durch die Änderung des Encodings der kategoriellen Features festzustellen.
Aufgrund der vielen Informationen, welche zu den Brunnen verfügbar sind, ist anzunehmen, dass nicht alle Features für die Klassifizierung relevant sind. Ein Experiment mit einem Random Forest soll überprüfen, ob auch nur mit einer bewussten Auswahl von Features gute Prognosen erzielt werden können.
rf_cf_red = RandomForestClassifier(random_state=42)
rf_cf_red.fit(X_train_u_lbl_reduced, y_train_u_r)
y_pred = rf_cf_red.predict_proba(X_train_u_lbl_reduced)[:, 1]
auc_train = roc_auc_score(y_train_u_r, y_pred)
cv_res = cross_validate(estimator=rf_cf_red, X=X_train_u_lbl, y=y_train_u_r, cv=5, scoring=scoring)
auc_cv = cv_res["test_AUC"].mean()
acc_cv = cv_res["test_Accuracy"].mean()
print(f"{str(rf_cf_red)} \nAUC Train: {auc_train}\nAUC CV: {auc_cv}\nACC CV: {acc_cv}")
RandomForestClassifier(random_state=42) AUC Train: 0.994516385716579 AUC CV: 0.8994907783381831 ACC CV: 0.8095965945849374
Fazit Reduced Features: Das entsprechende Modell performt merklich schlechter als zuvor erzielte Ergebnisse. Durch die ausgeschlossenen Features lies sich zwar die Dimension des Problems verkleinern, jedoch scheinen zu viele Informationen verloren zu gehen.
df_results = pd.DataFrame(data=[], columns=["Model", "AUC Train", "AUC CV"])
lr_cf = LogisticRegression(multi_class="ovr", max_iter=1000)
lr_cf.fit(X_train, y_train)
y_pred = lr_cf.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=lr_cf, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_results.loc[len(df_results)] = [lr_cf, auc_train, auc_cv]
svm_cf = SVC(probability=True, decision_function_shape="ovr")
svm_cf.fit(X_train, y_train)
y_pred = svm_cf.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=svm_cf, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()
df_results.loc[len(df_results)] = [svm_cf, auc_train, auc_cv]
knn_cf = KNeighborsClassifier()
knn_cf.fit(X_train, y_train)
y_pred = knn_cf.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=knn_cf, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_results.loc[len(df_results)] = [knn_cf, auc_train, auc_cv]
dt_cf = DecisionTreeClassifier()
dt_cf.fit(X_train, y_train)
y_pred = dt_cf.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=dt_cf, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_results.loc[len(df_results)] = [dt_cf, auc_train, auc_cv]
rf_cf = RandomForestClassifier(random_state=42)
rf_cf.fit(X_train, y_train)
y_pred = rf_cf.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_results.loc[len(df_results)] = [str(rf_cf), auc_train, auc_cv]
xgb_cf = XGBClassifier()
xgb_cf.fit(X_train, y_train)
y_pred = xgb_cf.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_results.loc[len(df_results)] = [xgb_cf, auc_train, auc_cv]
df_results
| Model | AUC Train | AUC CV | |
|---|---|---|---|
| 0 | LogisticRegression(max_iter=1000, multi_class=... | 0.869001 | 0.866267 |
| 1 | SVC(probability=True) | 0.912181 | 0.893104 |
| 2 | KNeighborsClassifier() | 0.957829 | 0.855479 |
| 3 | DecisionTreeClassifier() | 1.000000 | 0.751493 |
| 4 | RandomForestClassifier(random_state=42) | 1.000000 | 0.902239 |
| 5 | XGBClassifier(base_score=0.5, booster='gbtree'... | 0.964362 | 0.925640 |
Am besten performen die Ensemble-Modelle Random Forest und XGBoost. Der Decision Tree overfittet stark, die SVM wird nachfolgend aufgrund der Laufzeit ignoriert. Random Forest und XGBoost overfitten auch etwas, Verbesserungen durch Hyperparameter Tuning sind wahrscheinlich.
df_tuned = pd.DataFrame(data=[], columns=["Model", "AUC Train", "AUC CV"])
rf_cf = RandomForestClassifier(random_state=42)
model_params = {
'n_estimators': np.linspace(10,500,50,dtype=int),
'max_depth': np.linspace(2,30,15,dtype=int),
}
rs = RandomizedSearchCV(rf_cf, model_params, n_iter=20, scoring="roc_auc_ovr", cv=5, n_jobs=-1, random_state=42)
rs.fit(X_train, y_train)
rs.best_params_
{'n_estimators': 130, 'max_depth': 16}
param_grid = [{
"n_estimators":[300, 350, 400, 450, 500, 550],
"max_depth": [16, 18, 20],
}]
rf_cf_opt = RandomForestClassifier(random_state=42)
gs = GridSearchCV(rf_cf_opt, param_grid=param_grid, scoring="roc_auc_ovr", cv=5, n_jobs=-1)
gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
0.916088281138068
{'max_depth': 18, 'n_estimators': 550}
display(pd.DataFrame(gs.cv_results_)[["param_max_depth", "param_n_estimators", "mean_test_score", "rank_test_score"]])
| param_max_depth | param_n_estimators | mean_test_score | rank_test_score | |
|---|---|---|---|---|
| 0 | 16 | 300 | 0.915558 | 12 |
| 1 | 16 | 350 | 0.915649 | 11 |
| 2 | 16 | 400 | 0.915800 | 9 |
| 3 | 16 | 450 | 0.915867 | 7 |
| 4 | 16 | 500 | 0.915923 | 4 |
| 5 | 16 | 550 | 0.915915 | 6 |
| 6 | 18 | 300 | 0.915744 | 10 |
| 7 | 18 | 350 | 0.915837 | 8 |
| 8 | 18 | 400 | 0.915943 | 3 |
| 9 | 18 | 450 | 0.915918 | 5 |
| 10 | 18 | 500 | 0.915974 | 2 |
| 11 | 18 | 550 | 0.916088 | 1 |
| 12 | 20 | 300 | 0.914980 | 18 |
| 13 | 20 | 350 | 0.915065 | 17 |
| 14 | 20 | 400 | 0.915187 | 14 |
| 15 | 20 | 450 | 0.915175 | 15 |
| 16 | 20 | 500 | 0.915146 | 16 |
| 17 | 20 | 550 | 0.915264 | 13 |
Finetuning bringt nur geringe Verbesserungen
rf_cf_opt = gs.best_estimator_
# rf_cf_opt = RandomForestClassifier(random_state=42, max_depth=18, n_estimators=550)
rf_cf_opt.fit(X_train, y_train)
y_pred = rf_cf_opt.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf_opt, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_tuned.loc[len(df_tuned)] = [f"Random Forest - optimized - {gs.best_params_}", auc_train, auc_cv]
xgb_cf = XGBClassifier()
model_params = {
'n_estimators': np.linspace(10,400,50,dtype=int),
'max_depth': np.linspace(2,30,15,dtype=int),
'learning_rate': np.linspace(0.1,1,10)
}
rs = RandomizedSearchCV(xgb_cf, model_params, n_iter=15, scoring="roc_auc_ovr", cv=5, n_jobs=-1, random_state=42)
rs.fit(X_train, y_train)
rs.best_params_
{'n_estimators': 344, 'max_depth': 4, 'learning_rate': 0.5}
param_grid = [{
"n_estimators":[250, 300, 350],
"max_depth": [4, 6, 8, 10],
"learning_rate": [0.1, 0.5, 1]
}]
xgb_cf = XGBClassifier()
gs = GridSearchCV(xgb_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=5, n_jobs=-1)
gs.fit(X_train, y_train)
gs.best_params_
{'learning_rate': 0.1, 'max_depth': 8, 'n_estimators': 250}
xgb_cf_opt = gs.best_estimator_
xgb_cf_opt.fit(X_train, y_train)
y_pred = xgb_cf_opt.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf_opt, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_tuned.loc[len(df_tuned)] = [f"XGBoost - optimized - {gs.best_params_}", auc_train, auc_cv]
df_tuned
| Model | AUC Train | AUC CV | |
|---|---|---|---|
| 0 | Random Forest - optimized - {'max_depth': 18, ... | 0.992420 | 0.916088 |
| 1 | XGBoost - optimized - {'learning_rate': 0.1, '... | 0.978028 | 0.926525 |
Beide optimierten Modelle stellen eine Verbesserung gegenüber ihren Default-Varianten dar.
Aktuell wird mit sehr vielen Features trainiert (encodiert: 96 Spalten). Viele davon sind vorrangig mit 0en oder 1en befüllt (One-Hot und Binary Encoding). Der Feature-Raum kann also durch Dimensionsreduktion höchstwahrscheinlich noch stark komprimiert werden (zur Vermeidung des Fluchs der Dimensionalität, schnellerem Training, ...)
Lösung: PCA
pca_pl = PCA(n_components=len(X_train.T))
pca_pl.fit(X_train)
plt.plot(np.arange(pca_pl.n_components_) + 1, np.cumsum(pca_pl.explained_variance_ratio_), linewidth=1, color='red')
plt.title('Elbow Plot')
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.hlines(0.95, -1, 100, linestyle="--", linewidth=1, color="grey")
plt.vlines(40, 0, 1.1, linestyle="--", linewidth=1, color="grey")
plt.xlim(0,96)
plt.ylim(0,1.1)
plt.show()
Kein klarer "elbow" erkennbar, Beibehaltung von 95% der Varianz scheint sinnvoll, d.h. ca. 40 Komponenten
pca = PCA(n_components=40)
pca.fit(X_train)
X_train_ = pca.transform(X_train)
rf_cf = RandomForestClassifier(random_state=42)
rf_cf.fit(X_train_, y_train)
y_pred = rf_cf.predict_proba(X_train_)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf, X=X_train_, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
print(f"{auc_train}, {auc_cv}")
1.0, 0.8747497544670162
xgb_cf = XGBClassifier()
xgb_cf.fit(X_train_, y_train)
y_pred = xgb_cf.predict_proba(X_train_)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf, X=X_train_, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
print(f"{auc_train}, {auc_cv}")
0.9699099128236064, 0.90438109044224
Die Default-Modelle performen nun schlechter als zuvor (vielleicht wurden doch wichtige Informationen entfernt?), der Ansatz mit PCA wird deshalb nachfolgend nicht mehr weiter verfolgt.
Wie in der EDA sichtbar wurde, ist auch beim multinomialen Klassifikationsproblem eine Imbalance zwischen den 3 Klassen vorhanden. Eine Möglichkeit hiermit umzugehen, wurde oben bereits gezeigt (Undersampling). Hier wird nun eine andere Methode verwendet: class bzw. sample weights. Es werden Gewichte für jede Klasse berechnet, die in die Loss-Funktion beim Training eingehen. Für die kleinere Klasse werden höhere Gewichte gewählt, so dass deren Wichtigkeit zunimmt.
classes_weights = list(class_weight.compute_class_weight('balanced', classes=np.unique(y_train), y=y_train))
weights = np.ones(y_train.shape[0], dtype='float')
for i, val in enumerate(y_train):
weights[i] = classes_weights[val-1]
xgb_cf_w = XGBClassifier()
xgb_cf_w.fit(X_train, y_train, sample_weight=weights);
y_pred = xgb_cf_w.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf_w, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
print(f"{auc_train}, {auc_cv}")
0.9594796427886951, 0.9256401412787248
param_grid = [{
"n_estimators":[100, 150, 200, 250, 300],
"max_depth": [4, 6, 8, 10],
"learning_rate": [0.1, 0.5, 1]
}]
xgb_cf = XGBClassifier()
gs = GridSearchCV(xgb_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=5, n_jobs=-1)
gs.fit(X_train, y_train, sample_weight=weights)#fit_params={'sample_weight': weights})
xgb_cf_w_opt = gs.best_estimator_
xgb_cf_w_opt.fit(X_train, y_train, sample_weight=weights)
y_pred = xgb_cf_w_opt.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf_w_opt, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_tuned.loc[len(df_tuned)] = [f"XGBoost - weights -optimized {gs.best_params_}", auc_train, auc_cv]
rf_cf_w = RandomForestClassifier(random_state=42)
rf_cf_w.fit(X_train, y_train, sample_weight=weights)
y_pred = rf_cf_w.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf_w, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
print(f"{auc_train}, {auc_cv}")
0.9999999997066454, 0.9022388708551381
param_grid = [{
"n_estimators":[300, 350, 400, 450, 500, 550],
"max_depth": [16, 18, 20],
}]
rf_cf_w_opt = RandomForestClassifier(random_state=42)
gs = GridSearchCV(rf_cf_w_opt, param_grid=param_grid, scoring="roc_auc_ovr", cv=5, n_jobs=-1)
gs.fit(X_train, y_train, sample_weight=weights)#fit_params={'sample_weight': weights})
rf_cf_w_opt = gs.best_estimator_
rf_cf_w_opt.fit(X_train, y_train, sample_weight=weights)
y_pred = rf_cf_w_opt.predict_proba(X_train)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf_w_opt, X=X_train, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_tuned.loc[len(df_tuned)] = [f"Random Forest - weights -optimized {gs.best_params_}", auc_train, auc_cv]
In beiden Fällen wird die AUC nicht signifikant besser. Durch Optimierung mit roc_auc_ovr als Metrik wird bereits zuvor eine gute Vorhersage auch für kleinere Klassen in gewisser Weise erzwungen.
Das Konstruktionsjahr wurde bislang ignoriert. In den nachfolgenden Fällen geht dieses Feature als Alter des Brunnens (konstruiiert aus date_recorded - construction_year) in das Modell mit ein.
rf_cf = RandomForestClassifier(random_state=42)
rf_cf.fit(X_train_w_age , y_train)
y_pred = rf_cf.predict_proba(X_train_w_age)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf, X=X_train_w_age, y=y_train, cv=5, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()
print(f'{auc_train}, {auc_cv}')
0.9999999943007573, 0.903358859215191
Kleine Verbesserung, deshalb Hyperparamter Optimierung für dieses Modell.
param_grid = [{
"n_estimators":[400, 450, 500, 550],
"max_depth": [16, 18, 20],
}]
rf_cf = RandomForestClassifier(random_state=42)
gs = GridSearchCV(rf_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=5, n_jobs=-1)
gs.fit(X_train_w_age, y_train)
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42), n_jobs=-1,
param_grid=[{'max_depth': [16, 18, 20],
'n_estimators': [400, 450, 500, 550]}],
scoring='roc_auc_ovr')
rf_cf_opt_age = gs.best_estimator_
rf_cf_opt_age.fit(X_train_w_age, y_train)
y_pred = rf_cf_opt_age.predict_proba(X_train_w_age)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=rf_cf_opt_age, X=X_train_w_age, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_tuned.loc[len(df_tuned)] = [f"Random Forest - with Age - optimized - {gs.best_params_}", auc_train, auc_cv]
xgb_cf = XGBClassifier()
xgb_cf.fit(X_train_w_age, y_train)
y_pred = xgb_cf.predict_proba(X_train_w_age)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf, X=X_train_w_age, y=y_train, cv=5, scoring='roc_auc_ovr', n_jobs=-1)
auc_cv = cv_res["test_score"].mean()
print(f'{auc_train}, {auc_cv}')
0.9665996207049984, 0.9261132067285827
param_grid = [{
"n_estimators":[250, 300, 350],
"max_depth": [4, 6, 8, 10],
"learning_rate": [0.1, 0.5, 1]
}]
xgb_cf = XGBClassifier()
gs = GridSearchCV(xgb_cf, param_grid=param_grid, scoring="roc_auc_ovr", cv=5, n_jobs=-1)
gs.fit(X_train_w_age, y_train)
gs.best_params_
{'learning_rate': 0.1, 'max_depth': 8, 'n_estimators': 300}
xgb_cf_opt_age = gs.best_estimator_
xgb_cf_opt_age.fit(X_train_w_age, y_train)
y_pred = xgb_cf_opt_age.predict_proba(X_train_w_age)
auc_train = roc_auc_score(y_train, y_pred, multi_class='ovr')
cv_res = cross_validate(estimator=xgb_cf_opt_age, X=X_train_w_age, y=y_train, cv=5, scoring='roc_auc_ovr')
auc_cv = cv_res["test_score"].mean()
df_tuned.loc[len(df_tuned)] = [f"XGBoost - with Age - optimized {gs.best_params_}", auc_train, auc_cv]
df_tuned
| Model | AUC Train | AUC CV | |
|---|---|---|---|
| 0 | Random Forest - optimized - {'max_depth': 18, ... | 0.992420 | 0.916088 |
| 1 | XGBoost - optimized - {'learning_rate': 0.1, '... | 0.978028 | 0.926525 |
| 2 | XGBoost - weights -optimized {'learning_rate':... | 0.978048 | 0.926460 |
| 3 | Random Forest - weights -optimized {'max_depth... | 0.993145 | 0.915264 |
| 4 | Random Forest - with Age - optimized - {'max_d... | 0.992760 | 0.916654 |
| 5 | XGBoost - with Age - optimized {'learning_rate... | 0.983787 | 0.927360 |
Out-of-Sample Evaluation der besten Modelle mit einem separat preparierten Test-Set.
df_pump_test.head()
| id | amount_tsh | date_recorded | funder | gps_height | installer | longitude | latitude | wpt_name | num_private | ... | water_quality | quality_group | quantity | quantity_group | source | source_type | source_class | waterpoint_type | waterpoint_type_group | status_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 69572 | 6000.0 | 2011-03-14 | Roman | 1390 | Roman | 34.938093 | -9.856322 | none | 0 | ... | soft | good | enough | enough | spring | spring | groundwater | communal standpipe | communal standpipe | functional |
| 1 | 8776 | 0.0 | 2013-03-06 | Grumeti | 1399 | GRUMETI | 34.698766 | -2.147466 | Zahanati | 0 | ... | soft | good | insufficient | insufficient | rainwater harvesting | rainwater harvesting | surface | communal standpipe | communal standpipe | functional |
| 2 | 49056 | 0.0 | 2011-02-20 | Private | 62 | Private | 39.209518 | -7.034139 | Mzee Hokororo | 0 | ... | salty | salty | enough | enough | machine dbh | borehole | groundwater | other | other | functional |
| 3 | 58155 | 0.0 | 2011-09-04 | Unicef | 1656 | DWE | 34.569266 | -9.085515 | Kwa Rose Chaula | 0 | ... | soft | good | dry | dry | river | river/lake | surface | communal standpipe | communal standpipe | non functional |
| 4 | 34169 | 0.0 | 2011-07-22 | Hesawa | 1162 | DWE | 32.920154 | -1.947868 | Ngomee | 0 | ... | milky | milky | insufficient | insufficient | spring | spring | groundwater | other | other | functional needs repair |
5 rows × 41 columns
df_pump_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8910 entries, 0 to 8909 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 8910 non-null int64 1 amount_tsh 8910 non-null float64 2 date_recorded 8910 non-null object 3 funder 8369 non-null object 4 gps_height 8910 non-null int64 5 installer 8365 non-null object 6 longitude 8910 non-null float64 7 latitude 8910 non-null float64 8 wpt_name 8910 non-null object 9 num_private 8910 non-null int64 10 basin 8910 non-null object 11 subvillage 8854 non-null object 12 region 8910 non-null object 13 region_code 8910 non-null int64 14 district_code 8910 non-null int64 15 lga 8910 non-null object 16 ward 8910 non-null object 17 population 8910 non-null int64 18 public_meeting 8422 non-null object 19 recorded_by 8910 non-null object 20 scheme_management 8302 non-null object 21 scheme_name 4686 non-null object 22 permit 8445 non-null object 23 construction_year 8910 non-null int64 24 extraction_type 8910 non-null object 25 extraction_type_group 8910 non-null object 26 extraction_type_class 8910 non-null object 27 management 8910 non-null object 28 management_group 8910 non-null object 29 payment 8910 non-null object 30 payment_type 8910 non-null object 31 water_quality 8910 non-null object 32 quality_group 8910 non-null object 33 quantity 8910 non-null object 34 quantity_group 8910 non-null object 35 source 8910 non-null object 36 source_type 8910 non-null object 37 source_class 8910 non-null object 38 waterpoint_type 8910 non-null object 39 waterpoint_type_group 8910 non-null object 40 status_group 8910 non-null object dtypes: float64(3), int64(7), object(31) memory usage: 2.8+ MB
#Selbe Transformation wie auf den Trainingsdaten
df_test = df_pump_test.copy()
df_test["t_functional"] = df_test["status_group"].apply(lambda x: 0 if x=="non functional" else 1)
df_test["t_needs_rep"] = df_test["status_group"].apply(lambda x: 1 if x=="functional needs repair" else 0)
df_test = df_test.drop(df_test[df_test.longitude == 0].index)
df_test["amount_per_capita"] = df_test.apply(lambda x: x["amount_tsh"]/x["population"] if x["population"] != 0 else 0, axis=1)
df_test["age"]=df_test['date_recorded'].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))-df_test["construction_year"].apply(lambda x: datetime.datetime.strptime(str(x),'%Y') if x!=0 else datetime.datetime.now())
df_test["age"].loc[df_test["construction_year"]!=0]=df_test["age"].loc[df_test["construction_year"]!=0].apply(lambda x: x.days)
df_test["age"].loc[df_test["construction_year"]==0]=np.nan
df_test["district_code"] = df_test["district_code"].astype("category")
df_test["region_code"] = df_test["region_code"].astype("category")
y_test = df_test.loc[:,'status_group']
le = LabelEncoder()
y_test = le.fit_transform(y_test)
y_test_f = df_test["t_functional"]
y_test_r = df_test["t_needs_rep"]
X_test = col_trans.transform(X=df_test)
X_test_lbl = col_trans_lbl.transform(X=df_test)
X_test_w_age = col_trans_w_age.transform(X=df_test)
X_test_lbl_w_age = col_trans_lbl_w_age.transform(X=df_test)
X_test_lbl_reduced = col_trans_lbl_reduced.transform(X=df_test)
X_test_u = col_trans_u.transform(X=df_test)
X_test_lbl_w_age_u = col_trans_lbl_w_age_u.transform(X=df_test)
X_test_lbl_reduced_u = col_trans_lbl_reduced_u.transform(X=df_test)
print(X_test.shape)
print(X_test_lbl.shape)
print(X_test_w_age.shape)
print(X_test_lbl_w_age.shape)
print(X_test_lbl_reduced.shape)
print(X_test_u.shape)
print(X_test_lbl_w_age_u.shape)
print(X_test_lbl_reduced_u.shape)
print(f'({X_test.min()}, {X_test.max()})')
print(f'({X_test_lbl.min()}, {X_test_lbl.max()})')
print(f'({X_test_w_age.min()}, {X_test_w_age.max()})')
print(f'({X_test_lbl_w_age.min()}, {X_test_lbl_w_age.max()})')
print(f'({X_test_lbl_reduced.min()}, {X_test_lbl_reduced.max()})')
print(f'({X_test_u.min()}, {X_test_u.max()})')
print(f'({X_test_lbl_w_age_u.min()}, {X_test_lbl_w_age_u.max()})')
print(f'({X_test_lbl_reduced_u.min()}, {X_test_lbl_reduced_u.max()})')
(8630, 95) (8630, 25) (8630, 96) (8630, 26) (8630, 16) (8630, 92) (8630, 26) (8630, 16) (-9.977452173104417e-05, 1.0) (-9.977452173104417e-05, 1.0) (-9.977452173104417e-05, 1.0) (-9.977452173104417e-05, 1.0) (0.0, 1.0) (-0.0028318584070796474, 1.8115942028985506) (-0.0028318584070796474, 1.8115942028985506) (0.0, 1.8115942028985506)
from yellowbrick.classifier import ROCAUC
def plot_ROC(model, X_train, y_train, X_test, y_test):
vis = ROCAUC(model, encoder={0: 'functional', 1: ' func_needs_rep', 2: 'non_functional'})
vis.fit(X_train, y_train)
vis.score(X_test, y_test)
vis.show()
return vis
from sklearn.metrics import plot_confusion_matrix
df_act_pred = pd.DataFrame(data=[], columns=["Model", "Actuals", "Predicted"])
plt.figure(0).clf()
#Tuned XGB
y_pred_f = xgb_cf_t.predict_proba(X_test)[:,1]
y_pred_f_bin = xgb_cf_t.predict(X_test)
fpr, tpr, thresh = metrics.roc_curve(y_test_f, y_pred_f)
auc = metrics.roc_auc_score(y_test_f, y_pred_f)
plt.plot(fpr,tpr,label="XG Boost Tuned, auc="+str(auc))
df_act_pred.loc[len(df_act_pred)] = ["XG Boost Tuned", y_test_f, y_pred_f_bin]
#tuned Random Forest
y_pred_f = rf_cf_t.predict_proba(X_test)[:,1]
y_pred_f_bin = rf_cf_t.predict(X_test)
fpr, tpr, thresh = metrics.roc_curve(y_test_f, y_pred_f)
auc = metrics.roc_auc_score(y_test_f, y_pred_f)
plt.plot(fpr,tpr,label="Tuned Random Forest Tuned, auc="+str(auc))
df_act_pred.loc[len(df_act_pred)] = ["Random Forest Tuned", y_test_f, y_pred_f_bin]
print("Random Forest Tuned")
print(classification_report(y_test_f, y_pred_f_bin))
#Tuned Decision Tree
y_pred_f = dt_cf_t.predict_proba(X_test)[:,1]
y_pred_f_bin = dt_cf_t.predict(X_test)
fpr, tpr, thresh = metrics.roc_curve(y_test_f, y_pred_f)
auc = metrics.roc_auc_score(y_test_f, y_pred_f)
plt.plot(fpr,tpr,label="Decision Tree Tuned, auc="+str(auc))
df_act_pred.loc[len(df_act_pred)] = ["Decision Tree Tuned", y_test_f, y_pred_f_bin]
plt.legend(loc=0)
plt.show()
Random Forest Tuned
precision recall f1-score support
0 0.86 0.73 0.79 3364
1 0.84 0.92 0.88 5266
accuracy 0.85 8630
macro avg 0.85 0.83 0.83 8630
weighted avg 0.85 0.85 0.84 8630
Die AUC Kurve auf den Testdaten zeigt solide Ergebnisse
Getunter Random Forest löst die 1. Fragestellung
fig, axes = plt.subplots(nrows = 1, ncols = df_act_pred.shape[0] , figsize=(df_act_pred.shape[0]*5,3))
for i, row in df_act_pred.iterrows():
y_test_f = row['Actuals']
y_pred_f_bin = row['Predicted']
confusion_matrix = metrics.confusion_matrix(y_test_f, y_pred_f_bin)
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix, display_labels = [False, True])
cm_display.plot(ax = axes[i])
cm_display.ax_.set_title(row['Model'])
plt.show()
Die Verteilung innerhalb der Konfusion Matrix zeigt das selbe Ergebnis
lr_disp = RocCurveDisplay.from_estimator(lr_cf_tuned, X_test_u, y_test_r)
svm_disp = RocCurveDisplay.from_estimator(svm_cf_tuned, X_test_u, y_test_r, ax=lr_disp.ax_)
dt_disp = RocCurveDisplay.from_estimator(dt_cf_, X_test_u, y_test_r, ax=lr_disp.ax_)
rf_disp = RocCurveDisplay.from_estimator(rf_cf_tuned, X_test_u, y_test_r, ax=lr_disp.ax_)
rf_lbl_disp = RocCurveDisplay.from_estimator(rf_cf_lbl, X_test_lbl_w_age_u, y_test_r, ax=lr_disp.ax_)
xgb_disp = RocCurveDisplay.from_estimator(xgb_cf_tuned, X_test_u, y_test_r, ax=lr_disp.ax_)
lr_disp.figure_.suptitle("ROC curve comparison")
lr_disp.figure_.set_size_inches(8, 6)
plt.show()
Fazit Modellauswahl mit Testdaten: Bei der Evaluation mit den Testdaten wird klar, dass der Random Forest den höchsten AUC-Score liefern kann. Auch die SVM kann gut mithalten, allerdings ist sie aufgrund der "Trainingskosten" zusätzlich zu benachteiligen.
Welche Features waren wichtig für die Entscheidungen des Random Forest?
Für den Ordinal Encoding Random Forest kann der Einfluss von unterschiedlichen Features auf das Modell betrachtet werden. Es lassen sich ausgehend von im Business Understanding getroffenen Annahmen, welche Features vermutlich auf die Funktionalität eines Brunnens Einfluss haben könnten, einige Überraschungen feststellen.
feature_names = ['funder', 'installer', 'subvillage', 'lga', 'ward', 'management', 'basin', 'payment','water_quality', 'quantity',
'source','waterpoint_type', 'extraction_type_class', 'region', 'region_code', 'district_code', 'wpt_name',
"amount_tsh", "population", "amount_per_capita", 'gps_height', 'longitude', 'latitude', "scheme_management",
"date_recorded", "age"]
feature_scores = pd.Series(rf_cf_lbl.feature_importances_, index=feature_names).sort_values(ascending=False)
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.barplot(x=feature_scores, y=feature_scores.index)
ax.set_title("Feature Importance (Gain)")
ax.set_yticklabels(feature_scores.index)
ax.set_xlabel("Feature importance score")
ax.set_ylabel("Features")
plt.show()
from sklearn.metrics import confusion_matrix
predictions = rf_cf_tuned.predict(X_test_u)
cm = confusion_matrix(y_test_r, predictions, labels=rf_cf_tuned.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['needs_no_rep', 'func_needs_rep'])
disp.plot()
plt.title('RandomForestClassifier')
plt.show()
print('RandomForestClassifier: \n',classification_report(y_test_r, predictions))
RandomForestClassifier:
precision recall f1-score support
0 0.98 0.77 0.86 8055
1 0.19 0.78 0.31 575
accuracy 0.77 8630
macro avg 0.59 0.77 0.59 8630
weighted avg 0.93 0.77 0.82 8630
Mit dem Modell werden viele der reparaturbedürftigen Brunnen erkannt (nur 126 nicht, recall für 1 von 0.78). Dies geht jedoch auf Kosten der Precision für diese Klasse. Nachdem man jedoch tendenziell eher alle reparaturbedürftigen Brunnen erkennen möchte, ist dies annehmbar.
Fazit Evaluation gewähltes Modell: Wie bereits in der Data Exploration und in der Data Preparation klar wurde, besteht für die Ausprägung status_group = functional needs repair ein starkes Ungleichgewicht in den Daten. Aufgrund dessen kann hier im Vergleich zu Prognosemodell 1 kein gleich hoher AUC Score erreicht werden.
Aus wirtschaftlicher Sicht muss sich die Frage gestellt werden, ob eine Optimierung nach einem anderen Score (z.B. Recall, wenn möglichst alle wartungsbedürftigen Brunnen erkannt werden sollen) dem Ziel des Prognosemodells eher gerecht werden würde. Hier ist sind Aspekte wie der Precision-Recall Trade-off zu beachten.
auc_test_rf = roc_auc_score(y_test, rf_cf_opt.predict_proba(X_test), multi_class='ovr')
auc_test_xgb = roc_auc_score(y_test, xgb_cf_opt.predict_proba(X_test), multi_class='ovr')
auc_test_rf_w = roc_auc_score(y_test, rf_cf_w_opt.predict_proba(X_test), multi_class='ovr')
auc_test_xgb_w = roc_auc_score(y_test, xgb_cf_w_opt.predict_proba(X_test), multi_class='ovr')
auc_test_rf_age = roc_auc_score(y_test, rf_cf_opt_age.predict_proba(X_test_w_age), multi_class='ovr')
auc_test_xgb_age = roc_auc_score(y_test, xgb_cf_opt_age.predict_proba(X_test_w_age), multi_class='ovr')
df_tuned["AUC Test"] = [auc_test_rf, auc_test_xgb, auc_test_xgb_w, auc_test_rf_w, auc_test_rf_age, auc_test_xgb_age]
df_tuned
| Model | AUC Train | AUC CV | AUC Test | |
|---|---|---|---|---|
| 0 | Random Forest - optimized - {'max_depth': 18, ... | 0.992420 | 0.916088 | 0.899027 |
| 1 | XGBoost - optimized - {'learning_rate': 0.1, '... | 0.978028 | 0.926525 | 0.859736 |
| 2 | XGBoost - weights -optimized {'learning_rate':... | 0.978048 | 0.926460 | 0.863119 |
| 3 | Random Forest - weights -optimized {'max_depth... | 0.993145 | 0.915264 | 0.891627 |
| 4 | Random Forest - with Age - optimized - {'max_d... | 0.992760 | 0.916654 | 0.899959 |
| 5 | XGBoost - with Age - optimized {'learning_rate... | 0.983787 | 0.927360 | 0.863010 |
plot_ROC(rf_cf_opt_age, X_train_w_age, y_train, X_test_w_age, y_test);
fig = plt.figure(figsize=(14,5))
ax1 = fig.add_subplot(121)
plot_confusion_matrix(rf_cf_opt, X_test, y_test, ax=ax1)
ax2 = fig.add_subplot(122)
plot_confusion_matrix(rf_cf_w_opt, X_test, y_test, ax=ax2)
plt.show()
Das Deployment der Modelle ist nicht mehr Teil der Aufgabenstellung. Im Falle eines Deployments würden jedoch folgende Prognosemodelle für die 3 Fälle ausgewählt werden:
Allen drei Modellen liegt ein Random Forest zugrunde. Aufgrund der vielen kategoriellen Merkmale scheint diese Algorithmen-Wahl jedoch nicht verwunderlich.
https://towardsdatascience.com/the-easiest-way-to-plot-data-from-pandas-on-a-world-map-1a62962a27f3
https://www.scikit-yb.org/en/latest/api/classifier/rocauc.html#multi-class-rocauc-curves
https://plotly.com/python/scattermapbox/
https://imbalanced-learn.org/stable/under_sampling.html
https://contrib.scikit-learn.org/category_encoders/index.html